Tone Matters: Sentiment Classification of Support Tweets Using VADER and XGBoost

Samantha Chickeletti & Michael Alfrey (Advisor: Dr. Cohen)

2025-08-04

Introduction

  • Motivation and Context
  • VADER for Social Media Sentiment
  • Machine Learning Extension with XGBoost
  • Dataset Overview
  • Literature-Informed Methodology
  • Literature Review
  • Transition to Methods

Motivation and Context: Why Tone Matters

In today’s digital landscape, customer support conversations increasingly take place over chat and social media platforms. These short-form exchanges are often emotionally charged and can signal a customer’s satisfaction, frustration, or potential escalation. Understanding the emotional tone behind these messages can:

  • Improving service quality
  • Anticipating customer needs
  • Enhance the overall customer experience

Yet, analyzing this kind of shorthand-heavy language presents a unique challenge for traditional sentiment analysis models.

VADER for Social Media Sentiment

This project explores how VADER (Valence Aware Dictionary for sEntiment Reasoning), a lexicon-based sentiment analysis tool, can classify tone in real customer support messages (Hutto and Gilbert 2014). Its design prioritizes speed and interpretability, making it ideal for short, informal content like tweets and chat messages. VADER’s scoring mechanism is particularly sensitive to social media features such as emojis, capitalization, and punctuation, which are often critical to conveying tone in these environments (K. Barik and Misra 2024).

Machine Learning Extension with XGBoost

To build a full machine learning pipeline around VADER, we will use its sentiment scores (positive, neutral, negative) as labels and train an XGBoost classifier using TF-IDF features extracted from the message text. XGBoost is well-suited for this task because it performs efficiently with sparse, high-dimensional data and eliminates the need to hand-label messages or train a separate sentiment model from scratch.

Dataset Overview

The dataset selected for this project is the “Customer Support on Twitter” dataset from Kaggle, which contains real-world support interactions between users and brands such as Apple, Amazon, and Comcast. The messages are short, informal, and emotionally expressive—closely mirroring real-world customer support scenarios—and make the dataset ideal for sentiment analysis and predictive modeling.

Literature-Informed Methodology

Natural Language Processing (NLP) has become a vital tool for understanding customer sentiment across digital platforms. A variety of approaches have been proposed in the literature, from lexicon-based models such as VADER to machine learning methods like XGBoost. This review highlights the studies that informed the methodological design of our project.

Literature Review

  • Lexicon-Based Methods for Sentiment Analysis
  • Machine Learning for Sentiment Classification

Lexicon-Based Methods for Sentiment Analysis

  • Why VADER Works Well on Tweets
  • VADER in the Literature

Why VADER Works Well on Tweets

Lexicon based methods remain a powerful choice for analyzing short, informal messages. VADER is particularly effective because it incorporates key linguistic signals such as:

  • Capitalization: (e.g., "AWESOME" → increases intensity)
  • Punctuation: (e.g., ! → amplifies sentiment)
  • Slang, emojis, and emoticons: (e.g., :) → amplifies sentiment)
  • Negation: (e.g., "not good" → polarity reversal)

These elements help capture the nuanced sentiment found in customer service conversations that traditional lexicon models may often miss.

VADER in the Literature

Recent research continues to support and expand on VADER’s use. Barik and Misra (K. Barik and Misra 2024) evaluated an improved VADER lexicon in analyzing e-commerce reviews and emphasized its interpretability and processing speed. Chadha and Aryan (Chadha and Aryan 2023) also confirmed VADER’s reliability in sentiment classification tasks, noting its effectiveness in fast-paced business contexts. Youvan (Youvan 2024) offered a comprehensive review of VADER’s core logic, highlighting its treatment of intensifiers, negations, and informal expressions.

Machine Learning for Sentiment Classification

  • Limitations of Lexicons
  • Why We Use XGBoost

Limitations of Lexicons

While VADER is powerful, it’s limited to its predefined lexicon and rule set. To complement VADER’s labeling, we incorporate XGBoost, an efficient and scalable gradient boosting algorithm, as a supervised classifier.

Why We Use XGBoost

Lestari et al. (Lestari et al. 2025) compared XGBoost with AdaBoost for movie review classification and found XGBoost achieved higher accuracy and generalizability. Sefara and Rangata (Sefara and Rangata 2024) also found XGBoost to be the most effective model for classifying domain-specific tweets, outperforming Logistic Regression and SVM in both performance and efficiency. Lu and Schelle (Lu and Schelle 2025) demonstrated how XGBoost could be used to extract interpretable feature importance from tweet sentiment, providing a compelling case for our approach.

Pipeline

Methods

  • Preprocessing and Sentiment Labeling with VADER
  • Term Frequency–Inverse Document Frequency (TF-IDF)
  • eXtreme Gradient Boosting (XGBoost)
  • Evaluation Metrics
  • Authorship Note

Preprocessing and Sentiment Labeling with VADER

  • Text Cleaning and Normalization
  • Compound Sentiment Score Calculation
  • Sentiment Label Thresholds
  • Worked Example: VADER in Action
  • Automated Labeling for Supervised Learning

Text Cleaning and Normalization

Before applying VADER, our process began by cleaning the raw tweet text to ensure consistency. We removed URLs, user mentions, and hashtags. While VADER can handle informal text, this step was performed to improve text uniformity and prepare for downstream modeling.

Compound Sentiment Score Calculation

After cleaning, we applied VADER to generate a compound sentiment score for each tweet and label tweets as Positive, Neutral, or Negative based on standardized thresholds. The compound sentiment score is computed as:

\[ \text{compound score} = \frac{\sum_{i=1}^{n} s_i}{\sqrt{\sum_{i=1}^{n} s_i^2} + \alpha} \]

Where \(s_i\) is the sentiment score for each word or token and \(\alpha\) is a normalization constant (typically set to 15).

Sentiment Label Thresholds

The final sentiment labels are then assigned using the following thresholds:

  • Positive if compound ≥ 0.05

  • Neutral if -0.05 < compound < 0.05

  • Negative if compound ≤ -0.05

This automated labeling process served as the backbone for our supervised classification model.

Worked Example: VADER in Action

For example, a tweet reading:

“I’ve been delayed over an HOUR and STILL no response… this is ridiculous!!!”

Example VADER Scoring:

Feature Detected Element VADER Response Score Impact
Capitalization “HOUR”, “STILL” Increases intensity -0.10
Punctuation “…” and “!!!” Amplifies negative sentiment -0.25
Lexicon Match “ridiculous” Strong negative valence -0.25
Overall Tone Complaint/frustration Strongly negative -0.15
Final Compound -0.75

This tweet produces a compound score of –0.75 and is labeled as negative, indicating a clear negative sentiment.

Automated Labeling for Supervised Learning

By relying on VADER instead of manual annotation, we create a foundation for downstream supervised learning. This aligns with findings by Lu (2025), who demonstrated that VADER-labeled tweets combined with TF-IDF and XGBoost achieved performance comparable to manually labeled datasets (Lu and Schelle 2025). Next, we turn to feature extraction, to transform our labeled text into numerical form suitable for machine learning.

Term Frequency–Inverse Document Frequency (TF-IDF)

  • Representing Text as Features with TF-IDF
  • Calculating Term Frequency (TF)
  • Calculating Inverse Document Frequency (IDF)
  • Final TF-IDF Weighting
  • Why TF-IDF Works Well for Sentiment Features

Representing Text as Features with TF-IDF

To convert tweets into numerical features for modeling, we employ Term Frequency–Inverse Document Frequency (TF-IDF), a technique that quantifies how important each word is within the context of both the individual tweet and the overall corpus.

Calculating Term Frequency (TF)

Term Frequency (TF) measures how often a word appears in a single tweet (i.e., domain) relative to the total number of words in that tweet:

\[ \text{TF}_{w_n} = \frac{g_{w_n}^{d_m}}{T_{d_m}} \]

Where:
\(w_n\) is the \(n^{\text{th}}\) word in domain \(d_m\) (a tweet)
\(g_{w_n}^{d_m}\) is the number of times word \(w_n\) occurs in domain \(d_m\)
\(T_{d_m}\) is the total number of words in domain \(d_m\)

Example:
If the word delay appears twice in a 50-word tweet, its term frequency is:

\[ \text{TF}_{w_n} = \frac{2}{50} = 0.04 \]

Calculating Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF) evaluates how unique or informative a word is across the full set of tweets. Common words receive lower IDF scores, while rare or distinctive words receive higher scores:

\[ \text{IDF}_{w_n} = \log\left(\frac{T_{d_m}}{N_{w_n}}\right) \]

Where:
\(N_{w_n}\) is the number of documents that contain word w_n

Example:
If delay appears in 5 out of 500,000 tweets, its IDF will be much higher than that of hello, which may appear in 10,000 tweets.

Final TF-IDF Weighting

Finally, TF-IDF combines these two metrics to weight each word by how frequently it appears in a tweet and how rare it is across the full dataset:

\[ \text{TF-IDF}_{w_n} = \text{TF}_{w_n} \times \text{IDF}_{w_n} \]

Why TF-IDF Works Well for Sentiment Features

This process highlights terms that are both prominent in a tweet and distinctive across the dataset, making TF-IDF a powerful and interpretable technique for feature extraction in sentiment analysis pipelines (K. Barik and Misra 2024). With our feature matrix ready, we proceeded to modeling.

eXtreme Gradient Boosting (XGBoost)

  • Modeling with eXtreme Gradient Boosting (XGBoost)
  • Predicting Sentiment from TF-IDF Input
  • Example: Classifying a Tweet with XGBoost
  • Training via Regularized Objective Function
  • Loss and Regularization Terms
  • Efficiency and Interpretability

Modeling with eXtreme Gradient Boosting (XGBoost)

To model sentiment classifications based on TF-IDF features, we employ XGBoost (eXtreme Gradient Boosting), a scalable and regularized tree ensemble algorithm designed for both accuracy and efficiency. XGBoost builds an additive model by iteratively constructing decision trees that minimize a regularized objective function, which balances prediction accuracy with model simplicity. The objective consists of two components: a convex loss function that measures how well the model fits the data, and a regularization term that penalizes overly complex trees.

Predicting Sentiment from TF-IDF Input

Each predicted class label \(\hat{y}_i\) (positive, neutral, negative) is computed as the sum of outputs from \(K\) trees:
\[ \hat{y}_i = \phi(x_i) = \sum_{k=1}^K f_k(x_i), \quad f_k \in \mathcal{F} \]

Where:
\(x_i\): The input TF-IDF vector for tweet \(i\)
\(f_k(x_i)\): The prediction from the \(k^\text{th}\) tree for input \(x_i\)
\(\sum_{k=1}^K f_k(x_i)\): The sum of predictions for each class
\(\phi(x_i)\): The combined prediction from all trees

This formula is foundational to XGBoost. It expresses how the final prediction is built up iteratively from multiple decision trees, which is the basis of boosting. In classifying sentiment labels, the accumulated scores are passed through a softmax functions to determine class probabilities.

Example: Classifying a Tweet with XGBoost

Example:
Suppose we are using XGBoost to classify the sentiment of a tweet as positive, neutral, or negative, and the model has been trained with \(K\) = 3 boosting rounds (trees) per class.
For a new input tweet \(x_i,\) each of the 3 trees for each class outputs a score which is then summed for each class:
• Positive class score: 1.2 + 0.9 + 1.1 = 3.2
• Neutral class score: 0.5 + 0.6 + 0.3 = 1.4
• Negative class score: 0.8 + 0.7 + 0.6 = 2.1

Since the positive class has the highest total score, the model assigns the label positive.

Training via Regularized Objective Function

Once prediction scores are computed, XGBoost must also determine how to train itself to make better predictions through the process of learning optimal tree structure. This is done by minimizing the regularized objective function, which balances prediction accuracy and model complexity:
\[ \mathcal{L}(\phi) = \sum_{i} l(\hat{y}_i, y_i) + \sum_{k} \Omega(f_k) \] \[ \text{where }\Omega(f) = \gamma T + \frac{1}{2} \lambda \lVert w \rVert^2 \]

Loss and Regularization Terms

\(l(y_i, \hat{y}_i)\) is our differentiable convex loss function (softmax loss for multiclass classification), measuring how far off the model’s prediction \(\hat{y}_i\) is from the true label \(y_i\).,
\(f_k\) is the \(k^\text{th}\) decision tree in the ensemble,
\(T\): the number of leaves on a tree,
\(w\): the vector of leaf scores (weights),
\(\gamma\) and \(\lambda\): regularization parameters that control tree complexity.

Efficiency and Interpretability

Therefore, by combining a strong predictive loss with a tree-specific complexity penalty, XGBoost is able to generalize well to new data, outperforming simpler models while remaining computationally efficient (Chen and Guestrin 2016). It also provides feature importance scores, offering insights into which terms most influence predictions—a valuable asset for customer service teams seeking actionable feedback.

With the model trained, we evaluated its performance using several classification metrics.

Evaluation Metrics

  • Core Performance Metrics
  • Handling Class Imbalance

Core Performance Metrics

To understand how well our model performed, we used four core metrics:

  • Accuracy: The proportion of correct predictions.
  • Precision: The proportion of correct predictions among all tweets the model labeled as a given class.
  • Recall: The proportion of actual sentiment instances that were correctly identified.
  • F1 Score: The harmonic mean of precision and recall.

Handling Class Imbalance

These metrics were selected to account for class imbalance, which is common in sentiment data sets. For instance, neutral tweets often dominate volume, while negative tweets are more operationally important in customer service. Therefore, we paid close attention to class-specific precision and recall, especially for the negative class, to ensure that frustrated customer messages were identified without over-triggering on neutral ones (R. Barik and Misra 2024; Gandy and Smith 2025).

Authorship Note

Note: Some parts of this project were assisted by ChatGPT for writing support and citation formatting. All content was reviewed and edited by the authors to ensure accuracy and originality.

Analysis and Results

Data Exploration and Visualization

  • Source: Customer Support on Twitter (Kaggle)
  • 2.8M tweets exchanged between customers and major companies
  • Fields used:
    • text
    • created_at
    • inbound (customer or company)
    • Metadata for threading (excluded from modeling)

Preprocessing & Sentiment Labeling

  • Removed: URLs, mentions, hashtags
  • Used VADER to assign compound sentiment scores
  • Sentiment Distribution (VADER Labels):
    • Positive: 51.7%
    • Neutral: 24.6%
    • Negative: 23.7%

Surprising number of positive tweets due to resolution acknowledgments and polite brand replies.

Tone Differences by Sentiment

  • Positive: Thank, help, happy
  • Neutral: DM, issue, Hi
  • Negative: sorry, problem, now

Figure: Word clouds by sentiment class

TF-IDF and Feature Vectorization

Preparing for Modeling

  • Cleaned again for vectorization: lowercased, removed stopwords, special characters, URLs, mentions, hashtags
  • Applied TF-IDF (n-grams up to 2 words, 5,000 features)
  • Result: sparse matrix well-suited for XGBoost

TF-IDF highlights important and rare terms across tweets

TF-IDF and Feature Vectorization

Top 20 Most Informative Terms

TF-IDF and Feature Vectorization

Top 20 Most Informative Terms

  • TF-IDF helps us identify which words are not just common, but actually important for telling tweets apart. It’s like finding the loudest voices in a crowded room.
  • Words like ‘dm’, ‘help’, ‘thanks’, ‘sorry’, and ‘account’ highlight the nature of support conversations—many of which are requests for assistance, apologies, or follow-ups.
  • These high-weighted features help the XGBoost model detect tone and intent without needing deep semantic understanding. For example, the presence of words like ‘sorry’ and ‘delay’ may signal negative sentiment, while ‘thanks’ or ‘hi’ may suggest a positive or neutral interaction.

Modeling Pipeline

  • Classifier: XGBoost
  • Training data: 150,000 tweets (50k per class, stratified)
  • Test set: 562,355 tweets
  • Applied class weighting to improve recall for Negative tweets

Overall Performance

Final Model Results

Metric Value
Accuracy 77.1%
Precision 80.96%
Recall 77.1%
F1 Score 77.45%

Balanced performance, with emphasis on improving recall for negative sentiment

Class-Level Performance

Sentiment Precision Recall F1 Score
Negative 0.74 0.68 0.71
Neutral 0.62 0.95 0.75
Positive 0.93 0.73 0.82


  • High recall for Neutral (templated replies)

  • High precision for Positive

  • Improved recall for Negative from 64% → 68%

A key goal: identifying dissatisfaction more reliably.
Recall on Negative tweets improved from 64% → 68% using sample weighting

Key Takeaways

Business Impact

  • Flags high-risk conversations in real-time
  • Service quality tracking
  • Enhance escalation workflows

Limitations:

  • Trained on Twitter, performance may vary on email or chat
  • No conversation context
  • Struggles with sarcasm
  • VADER lexicon is static

Future Work:

  • Add threading/context
  • Explore LLMs or deep learning
  • Expand to multilingual support

Conclusion

This project successfully demonstrated a scalable approach to sentiment classification in customer support conversations by combining VADER, TF-IDF, and XGBoost. VADER provided fast and interpretable sentiment labels tailored for informal social media language, which we used to train a high-performing supervised classifier.

Achieved:

  • 77% accuracy

  • 0.71 F1 for Negative tone

  • Fast, scalable tone detection

By automating tone detection in real-time support channels, this framework offers immediate business value. It can help teams prioritize escalations, identify service bottlenecks, and monitor agent interactions at scale. Our findings confirm that interpretable, rule-based sentiment scoring (via VADER) can be successfully integrated with machine learning to support responsive, tone-aware customer engagement.

TEST TEST ## References

Barik, Kanhu, and Sanghamitra Misra. 2024. “Analysis of Customer Reviews with an Improved VADER Lexicon Classifier.” Journal of Big Data 11: 10. https://doi.org/10.1186/s40537-023-00861-x.
Barik, Rakesh, and A. Misra. 2024. “iVADER: Improving Rule-Based Sentiment Analysis for Informal Text.” Journal of Computational Linguistics.
Chadha, R., and C. Aryan. 2023. “A Study Analyzing an Innovative Approach to Sentiment Analysis with VADER.” Journal of Engineering Design and Analysis 6 (1): 23–27.
Chen, Tianqi, and Carlos Guestrin. 2016. “XGBoost: A Scalable Tree Boosting System.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–94. KDD ’16. ACM. https://doi.org/10.1145/2939672.2939785.
Gandy, Jordan, and Taylor Smith. 2025. “Public Sentiment Analysis on YouTube Comments.” Journal of Social Media Research.
Hutto, C. J., and Eric Gilbert. 2014. “VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text.” In Proceedings of the International AAAI Conference on Web and Social Media, 8:216–25. 1. https://doi.org/10.1609/icwsm.v8i1.14550.
Lestari, IGA N., NMRM Dewi, KG Meiliana, and IKAA Aryanto. 2025. “Effectiveness of AdaBoost and XGBoost Algorithms in Sentiment Analysis of Movie Reviews.” Journal of Applied Informatics and Computing 9 (2): 258–64. https://doi.org/10.30871/jaic.v9i2.9077.
Lu, Z., and A. Schelle. 2025. “Sentiment Analysis of Tesla Tweets: Leveraging XGBoost for Social Media Insights.” 48. Constructor University Technical Reports. https://nbn-resolving.org/urn:nbn:de:gbv:579-opus-1013041.
Sefara, T., and M. Rangata. 2024. “Domain-Specific Sentiment Analysis of Tweets Using Machine Learning Methods.” In Proceedings of the 17th International Conference on Web Information Systems and Technologies (WEBIST), 468–82. https://doi.org/10.1007/978-3-031-48858-0_37.
Youvan, D. 2024. “Understanding Sentiment Analysis with VADER: A Comprehensive Overview and Application.” https://doi.org/10.13140/RG.2.2.33567.98726.